avoid copying new_tokens to cpu #118

suyoggupta · 2025-07-27T08:02:20Z

pytorch:

                                                                                                                                                
===========================================================                                                                                     
= PERFORMANCE OVERVIEW                                                                                                                          
===========================================================                                                                                     
Request Throughput (req/sec):                     71.5680                                                                                       
Total Output Throughput (tokens/sec):             9160.7029                                                                                     
Total Token Throughput (tokens/sec):              18321.4059                                                                                    
Total Latency (ms):                               3577.0181                                                                                     
Average request latency (ms):                     3504.4893                                                                                     
Per User Output Throughput [w/ ctx] (tps/user):   36.5284
Per GPU Output Throughput (tps/gpu):              9160.7029

AutoDeploy:

                                                                                                                                                
===========================================================                                                                                     
= PERFORMANCE OVERVIEW                                                                                                                          
===========================================================                                                                                     
Request Throughput (req/sec):                     67.6845                                                                                       
Total Output Throughput (tokens/sec):             8663.6133                                                                                     
Total Token Throughput (tokens/sec):              17327.2265                                                                                    
Total Latency (ms):                               3782.2556                                                                                     
Average request latency (ms):                     3704.9923                                                                                     
Per User Output Throughput [w/ ctx] (tps/user):   34.5515                                                                                       
Per GPU Output Throughput (tps/gpu):              8663.6133

Signed-off-by: Suyog Gupta <[email protected]>

Copilot

Pull Request Overview

This PR optimizes GPU memory usage by avoiding unnecessary CPU transfers when processing new tokens in the auto deploy executor. Instead of copying new tokens from GPU to CPU and back, the changes preserve the new tokens on GPU throughout the processing pipeline.

Replace CPU conversion of new_tokens with direct GPU tensor handling
Update sequence information interface to accept GPU tensors directly
Add NVTX profiling ranges for performance monitoring
Optimize tensor creation with pinned memory for faster transfers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py	Removes CPU conversion of new_tokens and passes GPU tensors directly to sequence interface
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py	Updates sequence interface to handle GPU tensors and adds performance optimizations

Copilot · 2025-07-27T08:02:59Z

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

            max_num_tokens=max_num_tokens,
        )
-
+        print(" in seq_info for device: ", torch.cuda.current_device())


Debug print statement should be removed before merging to production. This appears to be leftover debugging code.

Suggested change

print(" in seq_info for device: ", torch.cuda.current_device())

ad_logger.info(f"in seq_info for device: {torch.cuda.current_device()}")

Copilot · 2025-07-27T08:02:59Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

+            self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]
+


This indexing assumes new_tokens has at least 3 dimensions and previous_batch_indices is valid, but there's no bounds checking. If previous_batch_indices contains invalid indices or new_tokens has different dimensions, this will cause a runtime error.

Suggested change

self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]

# Validate new_tokens dimensions

if new_tokens.dim() < 3:

raise ValueError(f"new_tokens must have at least 3 dimensions, but got {new_tokens.dim()}.")

# Validate previous_batch_indices

max_index = new_tokens.size(1) - 1

if any(idx < 0 or idx > max_index for idx in previous_batch_indices):

raise IndexError(f"previous_batch_indices contains out-of-bounds indices for new_tokens' second dimension (valid range: 0 to {max_index}).")

self.input_ids[self.input_ids == -1] = new_tokens[0, previous_batch_indices, 0]

Copilot · 2025-07-27T08:02:59Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

+        self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)
        self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)
        self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)
        self.input_pos = torch.empty_like(self.seq_len)


The tensor is created with device=self.device, but self.device is a property that depends on self.input_pos.device. At this point in post_init, input_pos hasn't been initialized yet, which could cause an AttributeError.

Suggested change

self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)

self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)

self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)

self.input_pos = torch.empty_like(self.seq_len)

self.seq_len = torch.empty(self.max_batch_size, dtype=torch.int)

self.input_pos = torch.empty_like(self.seq_len)

self.input_ids = torch.ones(self.max_num_tokens, dtype=torch.int, device=self.device)

self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)

avoid gpu->cpu transfer when using overlap scheduler

9ce4a16

Signed-off-by: Suyog Gupta <[email protected]>

Copilot AI review requested due to automatic review settings July 27, 2025 08:02

Copilot AI reviewed Jul 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avoid copying new_tokens to cpu #118

avoid copying new_tokens to cpu #118

Uh oh!

suyoggupta commented Jul 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 27, 2025

Uh oh!

Copilot AI Jul 27, 2025

Uh oh!

Copilot AI Jul 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	print(" in seq_info for device: ", torch.cuda.current_device())
	ad_logger.info(f"in seq_info for device: {torch.cuda.current_device()}")

		self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]

-            self.input_ids[self.input_ids == -1] = new_tokens[0,previous_batch_indices,0]
+            # Validate new_tokens dimensions
+            if new_tokens.dim() < 3:
+                raise ValueError(f"new_tokens must have at least 3 dimensions, but got {new_tokens.dim()}.")
+            # Validate previous_batch_indices
+            max_index = new_tokens.size(1) - 1
+            if any(idx < 0 or idx > max_index for idx in previous_batch_indices):
+                raise IndexError(f"previous_batch_indices contains out-of-bounds indices for new_tokens' second dimension (valid range: 0 to {max_index}).")
+            self.input_ids[self.input_ids == -1] = new_tokens[0, previous_batch_indices, 0]

avoid copying new_tokens to cpu #118

Are you sure you want to change the base?

avoid copying new_tokens to cpu #118

Uh oh!

Conversation

suyoggupta commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suyoggupta commented Jul 27, 2025 •

edited

Loading